Spatial Ecology and Macroecology

Practical - Week 1

2025-09-29

What are we going to see today?

  1. Data types
  2. Data sources
  3. Open data
  4. Exercise 1: explore data sources
  5. Exercise 2: data download and cleaning in R (quality check)

Data types

1. Data types

Data that can place a particular taxon in a particular location and time can take many forms, depending on:

  1. What they record (currency of the species’ distribution)
  2. How they are collected (method)
  3. How they are made available for others (openness)

1. Data types

Presence-only (PO) data

  • For example: records from museum and herbarium collections or citizen-science initiatives
  • Characteristics: usually opportunistic, single species, spatio-temporally specific, absences are unknown

PROS CONS
huge amounts of data available, easily aggregated often without details of effort/method, wide variation in data quality

1. Data types

Presence-absence (PA) data

  • For example: data from checklists, inventories, atlases, acoustic sensors, DNA sampling, or camera-trap surveys
  • Characteristics: multiple species, spatio-temporally specific, report searches that did not find the species (absences)

PROS CONS
absences are informative, area and effort are measured less abundant (too time-consuming), methods are species-specific

1. Data types

Repeated surveys

  • For example: monitoring schemes, repeated atlas projects
  • Characteristics: multiple species, over time, spatially defined, use a standardized protocol

PROS CONS
standardised protocols, multiple points in time expensive, geographically restricted, usually temporally too

1. Data types

Range-maps

  • outlines of species distributions, IUCN ranges, field guides
  • single species, expert-drawn

PROS CONS
rough estimates of the outer boundaries of areas within which species are likely to occur large spatial and temporal uncertainties

1. Data types

Data can also be defined by how they were collected.

1. Data types

Structured

  • clear survey design (location, target) and standardised sampling protocol
  • site selection: preselected locations, sometimes stratified random
  • metadata: informs about the survey methods

1. Data types

Semi-structured

  • no survey design but little standardised sampling protocol
  • site selection: free
  • metadata: informs about the observation process and survey methods

1. Data types

Unstructured (opportunistic)

  • no survey design and no standardised sampling protocol
  • site selection: free
  • metadata: almost non

1. Data types

Finally, data can also be defined by how they are made available for others.

1. Data types

Disaggregated

  • precision is high, but completeness and representativeness are low.

1. Data types

Aggregated

  • precision is low, but completeness and representativeness are high.

2. Data sources

gbif.org

GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

rgbif: https://github.com/ropensci/rgbif

obis.org

OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.

robis: https://github.com/iobis/robis

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

auk: https://cornelllabofornithology.github.io/auk/

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

rebird: https://github.com/ropensci/rebird

inaturalist.org

iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.

rinat: https://github.com/ropensci/rinat

observation.org

Observation.org is a global biodiversity platform for citizen science and monitoring, established in 2004. It mainly used in Europe.

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

rredlist: https://github.com/ropensci/rredlist

mol.org

Map of Life assembles and integrates different sources of data describing species distributions worldwide. It is developed by the Center for Biodiversity and Global Change at Yale University.

10.1016/j.dib.2017.05.007

Chorological maps for the main European woody species is a data paper with a dataset of chorological maps for the main European tree and shrub species, put together by Giovanni Caudullo, Erik Welk, and Jesús San-Miguel-Ayanz.

UK bto.org/our-science/projects/breeding-bird-survey

USA usgs.gov/centers/eesc/science/north-american-breeding-bird-survey


BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.

bien.nceas.ucsb.edu/bien/

BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.

rbien: https://github.com/bmaitner/RBIEN

sibbr.gov.br

SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.

sibbr: https://github.com/sibbr

biotime.st-andrews.ac.uk

BioTime is an open access global database of assemblage time series for quantifying and understanding biodiversity change.

BioTime Hub: https://github.com/bioTIMEHub

3. Open Data

3. Open Data

Open means anyone can freely access, use, modify, and share for any purpose.


Public doesn’t mean open

The data on the internet can be public but they are not necessarily open. They can be standard, available in open formats (e.g., csv), and yet, if they don’t have a licence, by default they are closed (all rights reserved).

3. Open Data: Licensing

Open data are licensed under open licenses. Some examples:


CC0: Public domain


CC-BY: Attribution


CC-BY-NC: Attribution - Non Commercial


CC-BY-SA: Attribution - Share Alike

3. Open Data: Data standards

Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.

dwc.tdwg.org

countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code. 

recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.

3. Open Data: Data sharing

Data that are standardized and have an open licence can be shared :)

EXERCISE 1 Explore different data sources

Imagine you want to start a project:

Choose one taxon and one data source, and try to get distribution data.

Then answer the following 3 questions:

  • What kind of data types does the source provide?
  • Which kind of taxa are covered by the database generally?
  • How accessible is the data? Can anyone download it? Restrictions?
  • What was your experience? What issues did you encounter while getting the data?

EXERCISE 2 Mammal’s of the Czech Republic

We will use the mammals of the Czech Republic as an example dataset. We will access the data through GBIF using R.

Some preparation before starting to code

  • Create a new project – you will use it for all your practical sessions (with code and data folders inside).

File > New project > New directory or Existing directory


1 Install and load libraries

We will always load packages into R using the package pacman.

install.packages('pacman')
library(pacman) # load
packageVersion('pacman')
[1] '0.5.1'


If you attempt to load a library that is not installed, pacman will try to install it automatically.

1 Install and load libraries

We will use tidyverse for the manipulation and transformation of data.

pacman::p_load(tidyverse) # Data wrangling
packageVersion('tidyverse')
[1] '2.0.0'


We will be using many functions from this library of package, like filter(), mutate(), and later read_csv().

1 Install and load libraries

We will use rgbif to download data from GBIF directly into our R session.

pacman::p_load(rgbif) # the GBIF R package
packageVersion('rgbif')
[1] '3.8.0'

1 Install and load libraries

We will need to get a taxon ID (taxonKey) for the Mammalia class from the GBIF backbone. For that, we will use another package called taxize.

pacman::p_load(taxize)
packageVersion('taxize')
[1] '0.9.100'

1 Install and load libraries

We will use sf to work with spatial data.

pacman::p_load(sf)
packageVersion('sf')
[1] '1.0.21'

1 Install and load libraries

We will use rnaturalearth to interact with Natural Earth and get mapping data (e.g., countries’ polygons) into R.

pacman::p_load(rnaturalearth)
packageVersion('rnaturalearth')
[1] '1.0.1'

2 Project variables

Create some variables that will be used later.

taxon <- "Mammalia"
country_code <- "CZ" # Two letters ISO code for the Czech Republic
proj_crs <- 4326 # EPSG code for WGS84


Define the things you already know you will use later in the script. For instance, we know that we will work with mammals from the Czech Republic and that the data we will get from GBIF are in WGS84 latitude and longitude.

2 Project variables

Get a taxon ID for the Mammalia class.

taxon_key <- get_gbifid_(taxon) %>%
  bind_rows() %>% # Transform the result of get_gbifid into a dataframe
  filter(matchtype == "HIGHERRANK" & status == "ACCEPTED") %>% # Filter the dataframe by the columns "matchtype" and "status"
  pull(usagekey) # Pull the contents of the column "usagekey"

taxon_key
[1] 359

2 Project variables

Basemap of CZ to use later for plotting or checking the dataset.

base_map <- rnaturalearth::ne_countries(
  scale = 110,
  type = 'countries',
  country = 'Czechia',
  returnclass = 'sf'
)

3 GBIF data download

And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.

occ_count(
  taxonKey = NULL,
  georeferenced = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  date = NULL,
  typeStatus = NULL,
  country = NULL,
  year = NULL,
  from = 2000,
  to = 2012,
  type = "count",
  publishingCountry = "US",
  protocol = NULL,
  curlopts = list()
)

3 GBIF data download

How many occurrence records are in GBIF for the entire Czech Republic?

occ_count(country = country_code) # country code for the Czech Republic
[1] 5514784


And how many of those records are mammals?

occ_count(country = country_code,
          taxonKey = taxon_key)
[1] 15735


After this initial exploration, we are ready to download data. Whoop!

3.1 CZ mammals’ GBIF data download

To do this, we will use occ_search(). This function has many options that correspond with fields in the GBIF database (DarwinCore terms).

occ_search(
  taxonKey = NULL,
  scientificName = NULL,
  country = NULL,
  publishingCountry = NULL,
  hasCoordinate = NULL,
  typeStatus = NULL,
  recordNumber = NULL,
  lastInterpreted = NULL,
  continent = NULL,
  geometry = NULL,
  geom_big = "asis",
  geom_size = 40,
  geom_n = 10,
  recordedBy = NULL,
  recordedByID = NULL,
  identifiedByID = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  eventDate = NULL,
  catalogNumber = NULL,
  year = NULL,
  month = NULL,
  decimalLatitude = NULL,
  decimalLongitude = NULL,
  elevation = NULL,
  depth = NULL,
  institutionCode = NULL,
  collectionCode = NULL,
  hasGeospatialIssue = NULL,
  issue = NULL,
  search = NULL,
  mediaType = NULL,
  subgenusKey = NULL,
  repatriated = NULL,
  phylumKey = NULL,
  kingdomKey = NULL,
  classKey = NULL,
  orderKey = NULL,
  familyKey = NULL,
  genusKey = NULL,
  establishmentMeans = NULL,
  protocol = NULL,
  license = NULL,
  organismId = NULL,
  publishingOrg = NULL,
  stateProvince = NULL,
  waterBody = NULL,
  locality = NULL,
  limit = 500,
  start = 0,
  fields = "all",
  return = NULL,
  facet = NULL,
  facetMincount = NULL,
  facetMultiselect = NULL,
  skip_validate = TRUE,
  curlopts = list(),
  ...
)

3.1 CZ mammals’ GBIF data download

This is not the best way to download data from GBIF. The best way would be to use the function occ_download(), but you will need a user account for this.


For our next practical, please create an account in GBIF.org and follow the script download_mammalsCZ_data_from_GBIF.R in the Week2_gridding_and plotting of the Practical_classes folder to get these data.

3.1 CZ mammals’ GBIF data download

Get occurrence records of mammals from Czech Republic.

occ_search(taxonKey = taxon_key,
           country = country_code) 
Records found [15735] 
Records returned [500] 
No. unique hierarchies [43] 
No. media records [500] 
No. facets [0] 
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
     fields=all] 
# A tibble: 500 × 104
   key        scientificName  decimalLatitude decimalLongitude issues datasetKey
   <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
 1 5006879567 Ovis aries mus…            50.1             14.6 cdc,c… 50c9509d-…
 2 5007205681 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 3 5007283594 Myocastor coyp…            49.6             17.3 cdc,c… 50c9509d-…
 4 5007542247 Capreolus capr…            48.9             14.4 cdc,c… 50c9509d-…
 5 5007576064 Capreolus capr…            49.2             16.9 cdc,c… 50c9509d-…
 6 5007738330 Rhinolophus hi…            49.4             16.7 cdc,c… 50c9509d-…
 7 5007845561 Mustela nivali…            50.2             14.4 cdc,c… 50c9509d-…
 8 5008416860 Capreolus capr…            50.0             14.5 cdc,c… 50c9509d-…
 9 5036714112 Lepus europaeu…            50.0             14.4 cdc,c… 50c9509d-…
10 5036838815 Ovis aries mus…            50.1             14.6 cdc,c… 50c9509d-…
# ℹ 490 more rows
# ℹ 98 more variables: publishingOrgKey <chr>, installationKey <chr>,
#   hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
#   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, lifeStage <chr>, taxonKey <int>, kingdomKey <int>,
#   phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
#   genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>, …

By default, it will only return the first 500 records

3.1 CZ mammals’ GBIF data download

To get all the records, we need to specify a larger limit. Since we have over 15,000 records, we should choose more than 16,000 as the limit. However, this will get super slow, so you could pick 5000.

occ_search(taxonKey = taxon_key,
           country = country_code,
            limit = 16000) 

3.1 CZ mammals’ GBIF data download

Finally, we store the result in the object mammalsCZ. We include the option to remove common geospatial issues (e.g., zero coordinates, country coordinate mismatch, invalid coordinate, etc.).

mammalsCZ <- occ_search(
  taxonKey = taxon_key, # Key 359 created previously
  country = country_code, # CZ, ISO code of the Czech Republic
  limit = 16000, # Max number of records to download
  hasGeospatialIssue = F # Only records without spatial issues
)

mammalsCZ <- mammalsCZ$data # The output of occ_search is a list with a data object inside. Here we pull the data out of the list.

4 Data exploration

Mammals occurrence records from the Czech Republic

Examine the dataset’s variables and their respective data types: Are they numeric, character, or boolean in nature?

glimpse(mammalsCZ)
Rows: 15,682
Columns: 228
$ key                                   <chr> "5006879567", "5007205681", "500…
$ scientificName                        <chr> "Ovis aries musimon (Pallas, 181…
$ decimalLatitude                       <dbl> 50.06399, 50.08033, 49.59528, 48…
$ decimalLongitude                      <dbl> 14.57059, 14.41240, 17.26240, 14…
$ issues                                <chr> "cdc,cdround", "cdc,cdround", "c…
$ datasetKey                            <chr> "50c9509d-22c7-4a22-a47d-8c48425…
$ publishingOrgKey                      <chr> "28eb1a3f-1c15-4a95-931a-4af90ec…
$ installationKey                       <chr> "997448a8-f762-11e1-a439-00145eb…
$ hostingOrganizationKey                <chr> "28eb1a3f-1c15-4a95-931a-4af90ec…
$ publishingCountry                     <chr> "US", "US", "US", "US", "US", "U…
$ protocol                              <chr> "DWC_ARCHIVE", "DWC_ARCHIVE", "D…
$ lastCrawled                           <chr> "2025-09-22T23:56:44.581+00:00",…
$ lastParsed                            <chr> "2025-09-23T14:13:19.446+00:00",…
$ crawlId                               <int> 560, 560, 560, 560, 560, 560, 56…
$ basisOfRecord                         <chr> "HUMAN_OBSERVATION", "HUMAN_OBSE…
$ occurrenceStatus                      <chr> "PRESENT", "PRESENT", "PRESENT",…
$ lifeStage                             <chr> "Adult", NA, NA, NA, NA, NA, "Ad…
$ taxonKey                              <int> 6165157, 4264680, 4264680, 52201…
$ kingdomKey                            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ phylumKey                             <int> 44, 44, 44, 44, 44, 44, 44, 44, …
$ classKey                              <int> 359, 359, 359, 359, 359, 359, 35…
$ orderKey                              <int> 731, 1459, 1459, 731, 731, 734, …
$ familyKey                             <int> 9614, 3240572, 3240572, 5298, 52…
$ genusKey                              <int> 9531221, 3240573, 3240573, 24409…
$ speciesKey                            <int> 2441110, 4264680, 4264680, 52201…
$ acceptedTaxonKey                      <int> 6165157, 4264680, 4264680, 52201…
$ acceptedScientificName                <chr> "Ovis aries musimon (Pallas, 181…
$ kingdom                               <chr> "Animalia", "Animalia", "Animali…
$ phylum                                <chr> "Chordata", "Chordata", "Chordat…
$ order                                 <chr> "Artiodactyla", "Rodentia", "Rod…
$ family                                <chr> "Bovidae", "Myocastoridae", "Myo…
$ genus                                 <chr> "Ovis", "Myocastor", "Myocastor"…
$ species                               <chr> "Ovis aries", "Myocastor coypus"…
$ genericName                           <chr> "Ovis", "Myocastor", "Myocastor"…
$ specificEpithet                       <chr> "aries", "coypus", "coypus", "ca…
$ infraspecificEpithet                  <chr> "musimon", NA, NA, NA, NA, NA, N…
$ taxonRank                             <chr> "SUBSPECIES", "SPECIES", "SPECIE…
$ taxonomicStatus                       <chr> "ACCEPTED", "ACCEPTED", "ACCEPTE…
$ dateIdentified                        <chr> "2025-01-04T21:37:05", "2025-01-…
$ coordinateUncertaintyInMeters         <dbl> 15, 8, 8, 31, 4038, 26550, 13, 3…
$ continent                             <chr> "EUROPE", "EUROPE", "EUROPE", "E…
$ stateProvince                         <chr> "Prague", "Prague", "Olomoucký",…
$ year                                  <int> 2025, 2025, 2025, 2025, 2025, 20…
$ month                                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day                                   <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ eventDate                             <chr> "2025-01-04T12:09", "2025-01-05T…
$ startDayOfYear                        <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ endDayOfYear                          <int> 4, 5, 3, 5, 3, 4, 2, 1, 14, 4, 8…
$ modified                              <chr> "2025-01-05T22:58:50.000+00:00",…
$ lastInterpreted                       <chr> "2025-09-23T14:13:19.446+00:00",…
$ references                            <chr> "https://www.inaturalist.org/obs…
$ license                               <chr> "http://creativecommons.org/lice…
$ isSequenced                           <lgl> FALSE, FALSE, FALSE, FALSE, FALS…
$ identifier                            <chr> "257396416", "257489614", "25737…
$ facts                                 <chr> "none", "none", "none", "none", …
$ relations                             <chr> "none", "none", "none", "none", …
$ isInCluster                           <lgl> FALSE, FALSE, FALSE, FALSE, FALS…
$ datasetName                           <chr> "iNaturalist research-grade obse…
$ recordedBy                            <chr> "Lioneska", "villllemo", "Václav…
$ identifiedBy                          <chr> "Lioneska", "villllemo", "Václav…
$ dnaSequenceID                         <chr> "none", "none", "none", "none", …
$ geodeticDatum                         <chr> "WGS84", "WGS84", "WGS84", "WGS8…
$ class                                 <chr> "Mammalia", "Mammalia", "Mammali…
$ countryCode                           <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "C…
$ recordedByIDs                         <chr> "none", "none", "none", "none", …
$ identifiedByIDs                       <chr> "none", "none", "none", "none", …
$ gbifRegion                            <chr> "EUROPE", "EUROPE", "EUROPE", "E…
$ country                               <chr> "Czechia", "Czechia", "Czechia",…
$ publishedByGbifRegion                 <chr> "NORTH_AMERICA", "NORTH_AMERICA"…
$ rightsHolder                          <chr> "Lioneska", "villllemo", "Václav…
$ identifier.1                          <chr> "257396416", "257489614", "25737…
$ http...unknown.org.nick               <chr> "lioneska", "villllemo", "vaclav…
$ verbatimEventDate                     <chr> "2025/01/04 12:09", "2025-01-05 …
$ dynamicProperties                     <chr> "{\"evidenceOfPresence\":\"organ…
$ collectionCode                        <chr> "Observations", "Observations", …
$ gbifID                                <chr> "5006879567", "5007205681", "500…
$ verbatimLocality                      <chr> "Praha-Dubeč, Česko", "Praha 1, …
$ occurrenceID                          <chr> "https://www.inaturalist.org/obs…
$ taxonID                               <chr> "340942", "43997", "43997", "421…
$ http...unknown.org.captive_cultivated <chr> "wild", "wild", "wild", "wild", …
$ catalogNumber                         <chr> "257396416", "257489614", "25737…
$ institutionCode                       <chr> "iNaturalist", "iNaturalist", "i…
$ vitality                              <chr> "alive", NA, NA, NA, "alive", NA…
$ eventTime                             <chr> "12:09:00+01:00", "18:34:08+01:0…
$ identificationID                      <chr> "580286295", "580568163", "58021…
$ name                                  <chr> "Ovis aries musimon (Pallas, 181…
$ iucnRedListCategory                   <chr> NA, "LC", "LC", "LC", "LC", "LC"…
$ projectId                             <chr> NA, NA, NA, NA, "https://www.ina…
$ informationWithheld                   <chr> NA, NA, NA, NA, NA, "Coordinate …
$ occurrenceRemarks                     <chr> NA, NA, NA, NA, NA, NA, "Pravděp…
$ distanceFromCentroidInMeters          <dbl> NA, NA, NA, NA, NA, NA, NA, 3839…
$ recordedByIDs.type                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recordedByIDs.value                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identifiedByIDs.type                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identifiedByIDs.value                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ sex                                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ individualCount                       <int> NA, NA, NA, NA, NA, NA, NA, NA, …
$ samplingProtocol                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ vernacularName                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ habitat                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locality                              <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationVerificationStatus      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventType                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationRemarks                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ dataGeneralizations                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ acceptedNameUsage                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ type                                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ datasetID                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ language                              <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ accessRights                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recordNumber                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.taxonRankID        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ taxonConceptID                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ taxonRemarks                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventID                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ otherCatalogNumbers                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ associatedReferences                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ parentEventID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ gadm                                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ associatedSequences                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ networkKeys                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ nameAccordingTo                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ coordinatePrecision                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferencedBy                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismQuantity                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismQuantityType                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ institutionKey                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ collectionKey                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ preparations                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ nomenclaturalCode                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ institutionID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ disposition                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ bibliographicCitation                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ collectionID                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.language           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ footprintWKT                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.modified           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ originalNameUsage                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ elevation                             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ elevationAccuracy                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.b295f32e72712cfa6ea8a0c5effd02a0.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ fieldNumber                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherGeography                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationAccordingTo                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferencedDate                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceProtocol                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimCoordinateSystem              <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ previousIdentifications               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherClassification                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceSources                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.ca2af10df069f7bb136331bbf56dff0f.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ ownerInstitutionCode                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ materialEntityID                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ footprintSRS                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimIdentification                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.cad7ac2bf910bf964a341390bbf671ef.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ locationID                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceRemarks                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.recordID           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ county                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ rights                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.recordEnteredBy    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ organismID                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ municipality                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ georeferenceVerificationStatus        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ establishmentMeans                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ acceptedNameUsageID                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ parentNameUsage                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ island                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ materialSampleID                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ eventRemarks                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ subfamily                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimElevation                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ higherGeographyID                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEpochOrLowestSeries           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEpochOrHighestSeries            <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ geologicalContextID                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestPeriodOrLowestSystem          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEonOrLowestEonothem           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEonOrHighestEonothem            <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestEraOrLowestErathem            <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestEraOrHighestErathem             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ latestPeriodOrHighestSystem           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ earliestAgeOrLowestStage              <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ namePublishedInYear                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ lithostratigraphicTerms               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimTaxonRank                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ identificationQualifier               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ bed                                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ formation                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.7c6eb36e75647675f673eb64f47d6da7.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.66327737867e96c11e11981ee34015bd.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.007fcdadb6565d1ed1b72315ebd20ac6.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.d3cc8ac74f8baed616f2c84f07c7a233.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.1a97f58a971298c71c1476ba348f11e1.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.88b8c1395fa380ccc56dff961ff1cedd.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.22b9ea985ca6e925fd53129340878f10.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.8552d5f41be35d89cbb98e8237c026a7.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.86ddd89f8bcf5f519c417603edd234d6.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5064b328c03546024166c533a776ce16.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c751f5f745cc4d794c092ab21ff6bdf8.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c6d3b644819f7bb7007a779a9b8fceec.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.dcdc0c2d528cb18330ad4a56856db3ed.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.02a1149be87e9edb0aa497e31afe07a3.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.56f596244647c2dcb92fa607a36aa258.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.2183b1b175c0d4d4062a90b12ba78265.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.79d2ec723f5d3fb758fc5954ff5daaef.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5470bd0db78d65544c25963b4bcd7c95.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.77845812e5846e98ff4f49d9beb44e35.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.a9577f32baafa53e12b254567552af5e.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.bb817299a5a54aa91c408611dbb5122d.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.05ae6f6e29a31b3be928c05a8748e11f.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.97660a21f2435cce604c66c6f163e3dd.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.07fe054c296139a4c4b9cb56916226da.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.85b733a072a635dc519b3896f5cebe90.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.594b08c316417115e35553092ac68876.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5c54f89ad01da246db5b356befb2190b.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.bc5ff1ae9e2bb01bb3c5fa4f651ef8fb.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.c0ef3a1800882fe27905a8fa3e523d4b.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.5eeca0ece84ccaed09ebf47ffe1adb88.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.d33fccc638f772e2005023df63d4028a.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ X.469b3883081304b435895823d3abcfc7.   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ combinationAuthors                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ verbatimScientificName                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ http...unknown.org.verbatimLabel      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ combinationYear                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ canonicalName                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, …

Check the data output. How many rows and columns does it have?

4 Data exploration

Mammals occurrence records from the Czech Republic

How many records do we have?

nrow(mammalsCZ)
[1] 15682


How many species do we have?

mammalsCZ %>%
  filter(taxonRank == "SPECIES") %>%
  distinct(scientificName) %>%
  nrow()
[1] 194

distinct() is used to see unique values

5 Data quality

Data are not ‘good’ or ‘bad’; the quality will depend on our goal.
Some things we can check:

  • Basis of the record - the specific nature of the record (type of occurrence)
  • Species names (taxonomic harmonisation)
  • Spatial and temporal (accuracy / precision)

CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner

Automated flagging of common spatial and temporal errors in data.

5.1 Basic data filtering

As an example of data cleaning procedures, we will check the following fields in our dataset:

  • basisOfRecord: we want preserved specimens or observations.
  • taxonRank: we want records at the species level.
  • coordinateUncertaintyInMeters: we want it to be smaller than 10km.

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>% distinct(basisOfRecord)
# A tibble: 7 × 1
  basisOfRecord     
  <chr>             
1 HUMAN_OBSERVATION 
2 MATERIAL_SAMPLE   
3 PRESERVED_SPECIMEN
4 OCCURRENCE        
5 FOSSIL_SPECIMEN   
6 OBSERVATION       
7 MATERIAL_CITATION 

distinct() is used to see unique values

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>%
  group_by(basisOfRecord) %>%
  count()
# A tibble: 7 × 2
# Groups:   basisOfRecord [7]
  basisOfRecord          n
  <chr>              <int>
1 FOSSIL_SPECIMEN      201
2 HUMAN_OBSERVATION  14146
3 MATERIAL_CITATION    206
4 MATERIAL_SAMPLE      105
5 OBSERVATION           22
6 OCCURRENCE            18
7 PRESERVED_SPECIMEN   984

group_by() is used to group values within a variable

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations

Update the object by filtering over the basisOfRecord to keep only records that correspond to “preserved specimens” or “human observations”.

mammalsCZ <- mammalsCZ %>%
  filter(basisOfRecord == "PRESERVED_SPECIMEN" |
    basisOfRecord == "HUMAN_OBSERVATION")

Note the use of | (OR) to filter the data. An alternative is filter(basisOfRecord %in% c("PRESERVED_SPECIMEN","HUMAN_OBSERVATION")).


How many records do we have now?

nrow(mammalsCZ)
[1] 15130

5.1 Basic data filtering

  • taxonRank: we want records at the species level
mammalsCZ %>% distinct(taxonRank)
# A tibble: 6 × 1
  taxonRank 
  <chr>     
1 SUBSPECIES
2 SPECIES   
3 GENUS     
4 ORDER     
5 FAMILY    
6 CLASS     

5.1 Basic data filtering

  • taxonRank: we want records at the species level

Update the object by filtering over taxonRank to keep only records that correspond to the “species” level.

mammalsCZ <- mammalsCZ %>% 
  filter(taxonRank == 'SPECIES')


How many records do we have now?

nrow(mammalsCZ)
[1] 14669

5.1 Basic data filtering

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ %>%
  filter(coordinateUncertaintyInMeters >= 10000) %>%
  select(scientificName, 
         coordinateUncertaintyInMeters, 
         stateProvince)
# A tibble: 426 × 3
   scientificName                           coordinateUncertaint…¹ stateProvince
   <chr>                                                     <dbl> <chr>        
 1 Rhinolophus hipposideros (Bechstein, 18…                  26550 Jihomoravský 
 2 Myocastor coypus (Molina, 1782)                           26550 Jihomoravský 
 3 Vulpes vulpes (Linnaeus, 1758)                            26518 Plzeňský     
 4 Myocastor coypus (Molina, 1782)                           26421 Prague       
 5 Myotis myotis (Borkhausen, 1797)                          26389 Královéhrade…
 6 Myotis emarginatus (E.Geoffroy, 1806)                     26389 Královéhrade…
 7 Capreolus capreolus (Linnaeus, 1758)                      26389 Středočeský  
 8 Lutra lutra (Linnaeus, 1758)                              26550 Kraj Vysočina
 9 Barbastella barbastellus (Schreber, 177…                  26550 Jihočeský    
10 Lepus europaeus Pallas, 1778                              26421 Prague       
# ℹ 416 more rows
# ℹ abbreviated name: ¹​coordinateUncertaintyInMeters

5.1 Basic data filtering

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km

Update the object by filtering over coordinateUncertaintyInMeters to keep only records that have less than 10000 meters of uncertainty.

mammalsCZ <- mammalsCZ %>% 
  filter(coordinateUncertaintyInMeters < 10000) # keeping this


How many records do we have now?

nrow(mammalsCZ)
[1] 11945

6 Basic maps

How are the records distributed?

We’ll get to this next week :)

6 Basic maps

And finally, a simple trick to produce separate maps per order.

In summary

  1. Identify a data type and source
  2. Check data-sharing agreements and licences
  3. Download data and associated metadata
  4. Check data quality (e.g., dates, spatial info, taxonomy)
  5. Clean data for purpose

Any doubts?